Swiftkey develop a word prediction application that is used while typing into a keyboards on a mobile keyboard. When the user types: “I went to the” : the application presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.
In this project, we use R to build a predictive model using text data provided by the Data Science Capstone course. The data consists of text from ‘Blogs’, ‘News’ and ‘Twitter’ totaling more than 4 million lines and ??? unique words.
In a nutshell, here’s a summary of the data analysis performed in this report.
First, I fetched the data from the URL provided by the course. Here are line counts per file.
$ wc -l data/final/en_US/*.txt
899288 data/final/en_US/en_US.blogs.txt
1010242 data/final/en_US/en_US.news.txt
2360148 data/final/en_US/en_US.twitter.txt
4269678 total
First, we sample 1% of the lines in the files in order to speed up the data exploration. The implementation is in sample_capstone_data in sample_data.R. We use tm R package to load each sample file for analysis.
sample_vector_corpus <- get_sample_datums_vector_corpus()
content_stats_df <- do_explore_per_data_source(sample_vector_corpus)
content_stats_df
## source num_lines num_unique_words mean_word_freq median_word_freq
## 1 twitter 23602 8040 20 9
## 2 blogs 8993 12414 15 6
## 3 news 10103 12850 15 7
## 4 all combined 3 22899 24 7
We perform the following text processing steps prior to parsing ngrams.
For example, look at the word frequency distribution for twitter sample data
p <- twitter_word_plot(sample_vector_corpus)
print(p)
Here are top bigrams.
p <- ngrams_per_source_plot(sample_vector_corpus, num_gram=2)
Here are top tri-grams
p <- ngrams_per_source_plot(sample_vector_corpus, num_gram=3)
Here are top 4-grams
p <- ngrams_per_source_plot(sample_vector_corpus, num_gram=4)
We build a tree using the ngrams and compute MLE () using the Dirichlet-multinomial model. We use node.tree which can build a tree from a data.frame. Now lets perform a search for “data”.
docs <- load_sample_dircorpus()
docs <- preprocess_entries(docs)
ngram_tree <- ngram_language_modeling(docs)
plot_tree_for_report(ngram_tree)
Here are the maximum likelihood estimates. They show 6% likelihood that entry will be the next word: “data entry” has a frequency = 12 and “data” has a frequency of 198 - so the maximimum likelihood estimate is 6.1%.
results <- perform_search(ngram_tree, c("data"))
print(results)
## 12 10
## recommended_words "entry" "streams"
## likelihood "0.0606060606060606" "0.0505050505050505"
## 8 7
## recommended_words "recovery" "dating"
## likelihood "0.0404040404040404" "0.0353535353535354"
## 7
## recommended_words "personalize"
## likelihood "0.0353535353535354"
Then if we query for “data entry”, we search the tree the nodes “data” then “entry” and we will recommend the words “just” and “respond”.
plot_tree_for_report(ngram_tree, highlight_child = TRUE)
results <- perform_search(ngram_tree, c("data", "entry"))
print(results)
## 6 6
## recommended_words "just" "respond"
## likelihood "0.5" "0.5"